MACH1: nonuniform time-scale modification of speech

نویسندگان

  • Michele Covell
  • Margaret Withgott
  • Malcolm Slaney
چکیده

Time-compression techniques change the playback rate of speech without introducing pitch artifacts. However, when linear-compression techniques are used, human comprehension of time-compressed speech typically degrades at compression rates above two times real time [1]. These degradations are not due to the speech rate per se: Comprehension of linearly compressed speech often breaks down above 225 to 270 words per minute (wpm) [2], which is well below the rates at which long passages of natural speech are comprehensible (up to 500 wpm) [3]. Instead, the incomprehensibility of time-compressed speech is due to its unnatural timing. Mach1, described in Section 2, is an alternative to linear time compression. Mach1 compresses the components of an utterance to resemble closely the natural timing of fast speech. Section 3 describes our test of comprehension and preference levels for Mach1-compressed and linearly compressed speech. In Section 4, we draw our conclusions. 2 MACH1 TIME COMPRESSION Mach1 mimics the compression strategies that people use when they talk fast in natural settings. We used linguistic studies of natural speech [4,5] to derive these goals: • Compress pauses and silences the most • Compress stressed vowels the least • Compress unstressed vowels by an intermediate amount • Compress consonants based on the stress level of the neighboring vowels • On average, compress consonants more than vowels Also, to avoid obliterating very short segments, we need to avoid overcompressing already rapid sections of speech. Unlike previous techniques [6,7], Mach1 deliberately avoids categorical recognition (such as silence detection). Instead, as illustrated in Figure 1, it estimates continuous-valued measures of local emphasis and relative speaking rate. Together, these two sequences estimate what we call audio tension: the degree to which the local speech segments resist changes in rate. High-tension segments are less compressible than low-tension segments. Based on the audio tension, we modify the target compression rate to give local target compression rates. We use these local target rates to drive a standard, timescale modification technique (e.g., synchronized overlap-add [8]). 2.1 Local-Emphasis Measure We use the local-emphasis measure to distinguish among silence, unstressed syllables, and stressed syllables. Emphasis in speech correlates with relative loudness, pitch variations, and duration [9]. Of these, relative loudness is the most easily estimated. 2.1.1 Estimating local energy To estimate local emphasis, we first calculate the local energy. We simply use the frame energies from the spectrogram used in speaking rate estimation (see Section 2.2). 2.1.2 Normalizing by …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mach1 for Nonuniform Time-scale Modification of Speech: Theory, Technique, and Comparisons

We propose a new approach to nonuniform time compression, called Mach1, designed to mimic the natural timing of fast speech. At identical overall compression rates, listener comprehension for Mach1-compressed speech increased between 5 and 31 percentage points 2 over that for linearly compressed speech, and response times dropped by 15%. For rates between 2.5 and 4.2 times real time, there was ...

متن کامل

Modification of Audible and Visual Speech

Speech is one of the most common and richest methods that people use to communicate with one another. Our facility with this communication form makes speech a good interface for communicating with or via computers. At the same time, our familiarity with speech makes it difficult to generate synthetic but naturalsounding speech and synthetic but natural-looking lip-synced faces. One way to reduc...

متن کامل

Variable time-scale modification of speech using transient information

Conventional time-scale modification methods have the problem that as the modification rate gets higher the time-scale modified speech signal becomes less intelligible, because they ignore the effect of articulation rate on speech characteristics. In this paper, we propose a variable time-scale modification method based on the knowledge that the timing information of transient portions of a spe...

متن کامل

Effects of Pitch Contours Stylization and Time Scale Modification on Natural Speech Synthesis

This paper describes the method of generation of intonated speech for natural speech synthesis using prosody generation model. The effect of pitch modification through pitch contour stylization for parameter extraction and time scale modification for it’s implementation has been mentioned. An approach for close-copy syllabic stylization has been described. In the latter part, algorithm for impl...

متن کامل

A Speaking Rate Normalization Method Using Time-Scale Modification for Speech Recognition

In this paper, we propose a speaking rate normalization method by selecting a scaling factor of time-scale modification for speech recognition. It is shown from the speech recognition experiments that the proposed method reduces average word error rate compared to that without using any speaking rate normalization.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998